Powered by Rmarkdown.

1 Introduction

1.1 Identifier mappings:

NCT_ID →(JensenLab:Tagger)→ DOID

NCT_ID →(AACT)→ MeSH

NCT_ID →(NextMove:LeadMine)→ SMILES

SMILES →(PubChem)→ CID

CID →(PubChem)→ INCHIKEY

INCHIKEY →(ChEMBL)→ MOLECULE_CHEMBL_ID

MOLECULE_CHEMBL_ID →(ChEMBL)→ ACTIVITY_ID

ACTIVITY_ID →(ChEMBL)→ TARGET_CHEMBL_ID

TARGET_CHEMBL_ID →(ChEMBL)→ COMPONENT_ID

COMPONENT_ID →(ChEMBL)→ UNIPROT

ACTIVITY_ID →(ChEMBL)→ DOCUMENT_CHEMBL_ID

DOCUMENT_CHEMBL_ID →(ChEMBL)→ PUBMED_ID

1.2 Input files:

  • (CTTI AACT) aact_studies.tsv
  • (CTTI AACT) aact_drugs.tsv
  • (CTTI AACT) aact_descriptions.tsv
  • (NextMove LeadMine) aact_drugs_leadmine.tsv
  • (PubChem) aact_drugs_smi_pubchem_cid.tsv
  • (PubChem) aact_drugs_smi_pubchem_cid2ink.tsv
  • (ChEMBL) aact_drugs_ink2chembl.tsv
  • (ChEMBL) aact_drugs_chembl_activity_pchembl.tsv
  • (ChEMBL) aact_drugs_chembl_target_component.tsv
  • (IDG TCRD/Pharos) pharos_targets.tsv
  • (JensenLab Tagger) aact_descriptions_tagger_matches.tsv
  • (JensenLab Dictionary) diseases_entities.tsv

nct_id is the study ID.

## [1] "Thu Apr 11 15:09:30 2019"
library(readr)
library(data.table)
library(stringr)
library(plotly, quietly=T)

2 Input studies and drugs

2.1 Studies

Read file of all studies in AACT.

## [1] "Total studies: 300214 ; unique NCT_IDs: 300214"

2.1.1 Study references

Reference type results_reference may offer greater evidence, confidence.

## [1] "references: 388031; NCT_IDs: 61208; PMIDs: 287758; results_references: 64880"

2.2 Drugs

Read file of all drugs in AACT.

  • id is AACT INTERVENTION_ID, corresponding with an instance of a drug, dose, delivery, etc. in a study.
  • Note that one study may involve multiple drugs.
  • At this point a “drug” is imprecisely identified by name, generally one of many synonyms.
## [1] "Unique drug names: 91347 ; unique intervention IDs: 255077"

2.3 Studies: Interventional drug studies only

Select only Interventional studies (study_type) associated with drugs (via NCT_ID).

## [1] "Interventional studies: 237892 (79.2%)"
## [1] "Interventional drug studies: 124421 ; unique NCT_IDs: 124421"
Drug studies and drugs, by phase
phase N_studies N_drugs
Early Phase 1 1574 2615
Phase 1 23603 48593
Phase 1/Phase 2 6663 13288
Phase 2 33910 68850
Phase 2/Phase 3 3305 6503
Phase 3 22988 49507
Phase 4 19593 36331
NA 12785 29390
Drug studies and drugs, by overall_status
overall_status N_studies N_drugs
Active, not recruiting 6420 13962
Completed 72053 145006
Enrolling by invitation 638 1060
Not yet recruiting 4138 8001
Recruiting 16723 33973
Suspended 463 945
Terminated 10138 19618
Unknown status 10106 18463
Withdrawn 3742 6969

2.4 Drug studies by Phase and Status

2.5 Drug studies and drugs by start_year

## Warning: Ignoring 1 observations

## Warning: Ignoring 1 observations

3 NextMove Leadmine Chemical NER

AACT drug names resolved to standard names and structures via SMILES. Note that one name may include multiple chemicals. Now we can use cheminformatically rigorous counts for drugs as active pharmaceutical ingredients (APIs).

## [1] "Drug unique SMILES resolved by LeadMine: 4699 ; unique intervention IDs: 171741"

3.1 Chemical NER mentions

3.1.1 Totals by merging of synonyms to resolved structure (locally canonical SMILES)

Top 20 drugs by total mentions
smi2img N_mentions names
2637 Abraxane; PACLITAXEL; Paclitaxel; Taxol; abraxane; paclitaxel; taxol
2545 CYCLOPHOSPHAMIDE; Ciclophosphamide; Cyclophosphamid; Cyclophosphamide; ciclophosphamide; cyclophosphamide
2461 CISPLATIN; Cis-platinum; Cisplatin; Cisplatine; Cisplatinum; cis Platinum; cis-platinum; cisplatin; cisplatine; cisplatinum
2070 DEXAMETHASONE; Dexamethason; Dexamethasone; Dexamethosone; Maxitrol; OZURDEX; Oradexon; Ozurdex; dexamethason; dexamethasone; dexamethosone
2054 CARBOPLATIN; Carboplatin; Carboplatine; Paraplatin; carboplatin; carboplatine
1779 DOCETAXEL; Docetaxel; docetaxel
1625 METFORMIN; MetFORMIN; Metformin; Metformine; metformin; metformine
1540 GEMCITABINE; Gemcitabine; gemcitabine
1342 CAPECITABINE; Capecitabin; Capecitabine; XELODA; Xeloda; capecitabine; xeloda
1178 Cortancyl; Lodotra; Meticorten; Prednison; Prednisone; RAYOS; prednison; prednisone
1157 0xaliplatin; Eloxatin; OXALIPLATIN; OXAliplatin; Oxaliplatin; Oxaliplatine; eloxatin; oxaliplatin; oxaliplatine
1157 METHOTREXATE; Methotrexate; Metoject; methotrexate
1086 BUPIVACAINE; Bupivacain; Bupivacaine; EXPAREL; Exparel; SKY0402; bupivacain; bupivacaine
1044 ETOPOSIDE; Etoposid; Etoposide; etoposide
1027 ADOPORT; ADVAGRAF; Adoport; Advagraf; ENVARSUS; Envarsus; FK-506; FK506; PROGRAF; Prograf; Protopic; TACROLIMUS; Tacrolimus; tacrolimus
978 NORMAL SALINE; Normal Saline; Normal saline; normal salin; normal saline
977 LIDOCAINE; LMX 4; LMX4; Lidocain; Lidocaine; Lidoderm; Lignocain; Lignocaine; Oraqix; lidocain; lidocaine; lignocaine
908 CYTARABINE; Cytarabine; Cytosar; DepoCyt; DepoCyte; Depocyt; Depocyte; cytarabine; cytosar
903 COPEGUS; Copegus; REBETOL; RIBAVIRIN; Rebetol; Ribasphere; Ribavarin; Ribavirin; Ribavirine; Virazole; rebetol; ribavarin; ribavirin
846 Diprivan; PROPOFOL; Propofol; propofol

3.1.2 Chemical NER mentions resolved to structures (SMILES)

## [1] "Drugs (drug names) with resolved structure: 180555 / 197300 (91.5%)"

3.1.3 Chemical NER mentions by intervention ID.

## [1] "Mentions by intervention ID: 157862 / 171741 (91.9%)"

3.1.4 Chemical NER mentions by trial (NCT ID).

## [1] "Mentions by study: 92966 / 99647 (93.3%)"

3.1.5 Chemical NER mentions by drug, i.e. name in AACT.

## [1] "Mentions by drug name: 11108 / 58297 (19.1%)"

4 PubChem:

4.1 Intervention IDs to CIDs from PubChem (via SMILES)

## [1] "PubChem SMILES2CID hits: 3933 / 4540 (86.6%)"
## [1] "Intervention IDs mapped to PubChem CIDs (via SMILES): 153342"

4.2 InChIKeys from PubChem (via CIDs)

## [1] "PubChem CIDs with InChIKeys: 3783"

5 IDG/TCRD:

For Target Development Level (TDL) and other metadata.

6 ChEMBL:

6.1 ChEMBL molecule IDs, and properties (via InChIKeys)

Perhaps should instead use PubChem CIDs and UniChem.

## [1] "ChEMBL compounds mapped via InChIKeys: 3316"

6.2 ChEMBL activities for mapped compounds

Select only activities with pChembl values for relevance to protein targets and confidence.

## [1] "ChEMBL activities: 127943"
## [1] "ChEMBL activities molecules: 2302 ; canonical_smiles: 2302 ; targets: 3877 ; documents: 16959"

6.2.1 Activity and molecule counts by assay types

Activity and molecule counts by assay types
assay_type N_molecule N_activity
F:Functional 1828 73811
B:Binding 1831 49891
A:ADMET 759 4058
P:Physicochemical 44 120
T:Toxicity 28 59
U:Unclassified 3 4

6.3 ChEMBL targets (via activities)

## [1] "ChEMBL target proteins: 3157"
## [1] "ChEMBL target proteins mapped to TCRD (human): 1805"

6.4 ChEMBL targets by organism:

## [1] "Organisms: 187"
Targets by organism (top 10)
organism N_targets Types
Homo sapiens 1806 CHIMERIC PROTEIN; PROTEIN COMPLEX; PROTEIN COMPLEX GROUP; PROTEIN FAMILY; PROTEIN-PROTEIN INTERACTION; SELECTIVITY GROUP; SINGLE PROTEIN
Rattus norvegicus 529 PROTEIN COMPLEX; PROTEIN COMPLEX GROUP; PROTEIN FAMILY; SELECTIVITY GROUP; SINGLE PROTEIN
Mus musculus 238 CHIMERIC PROTEIN; PROTEIN COMPLEX; PROTEIN COMPLEX GROUP; PROTEIN FAMILY; SINGLE PROTEIN
Bos taurus 98 PROTEIN COMPLEX; PROTEIN COMPLEX GROUP; PROTEIN FAMILY; SINGLE PROTEIN
Sus scrofa 36 PROTEIN COMPLEX; PROTEIN FAMILY; SINGLE PROTEIN
Cavia porcellus 26 SINGLE PROTEIN
Escherichia coli K-12 19 PROTEIN COMPLEX; PROTEIN FAMILY; SINGLE PROTEIN
Oryctolagus cuniculus 18 SINGLE PROTEIN
Escherichia coli 17 PROTEIN COMPLEX; SINGLE PROTEIN
Mycobacterium tuberculosis 17 SINGLE PROTEIN

6.5 Human single-protein targets only, by IDG family.

## [1] "Human targets: 1806"
idgFamily N
Kinase 405
Enzyme 330
GPCR 158
None 120
IC 64
Transporter 53
Epigenetic 35
NR 28
TF 20
TF; Epigenetic 3
## [1] "Human single-protein targets: 1216 ; unique UniProts: 1216"

6.6 ChEMBL targets by IDG TDL:

## [1] "   Tchem:    767" "   Tclin:    342" "    Tbio:    105"
## [4] "   Tdark:      2"

7 JensenLab Tagger Diseases NER

With JensenLab DOID entities dictionary. On descriptions from detailed_descriptions table.

Likely false positives, manually removed:

7.1 Disease mention totals by merging to resolved Disease Ontology term (DOID).

Top 20 diseases by total mentions
doid N_mentions terms
DOID:162 28596 CANCER; CANcer; Cancer; Malignant Tumor; Malignant neoplasm; Malignant tumor; Primary Cancer; Primary cancer; cancer; malignant Tumor; malignant neoplasm; malignant tumor; primary cancer
DOID:9351 17274 DIABETES; DIABETES MELLITUS; DIAbetes; DIabetes; Diabetes; Diabetes Mellitus; Diabetes mellitus; diabetes; diabetes Mellitus; diabetes mellitus; diabetes-mellitus
DOID:6713 16632 CVA; Cerebrovascular Accident; Cerebrovascular Disease; Cerebrovascular accident; Cerebrovascular disease; STROKE; STRokE; Stroke; cerebro- vascular disease; cerebro-vascular disease; cerebrovascul…
DOID:2030 12084 ANXIETY; Anxiety; Anxiety Disorder; Anxiety state; anxiety; anxiety disorder; anxiety state; anxiety syndrome; anxiety-state
DOID:1612 10583 BREAST CANCER; BReast CAncer; BReast Cancer; Breast Cancer; Breast cancer; Breast tumor; Breast-cancer; Primary breast cancer; breast Cancer; breast caNcEr; breast cancer; breast tumor; breast-canc…
DOID:2841 10021 ASTHMA; Asthma; BHR; Bronchial hyper-reactivity; Bronchial hyperreactivity; EIA; Exercise-induced asthma; asthma; bronchial hyper reactivity; bronchial hyper-reactivity; bronchial hyperreactivity; …
DOID:3083 9782 CHRONIC OBSTRUCTIVE PULMONARY DISEASE; COLD; COPD; COPd; Chronic Obstructive Lung Disease; Chronic Obstructive Lung disease; Chronic Obstructive Pulmonary Disease; Chronic Obstructive Pulmonary dis…
DOID:9970 9303 OBESITY; OBesity; Obesity; obEsity; obe-sity; obesity
DOID:10763 9144 HBP; HTN; HYPERTENSION; High Blood Pressure; High blood pressure; High-blood pressure; Hypertension; Hypertensive disease; high blood Pressure; high blood pressure; high blood-pressure; htn; hyper-…
DOID:3393 6816 C-HD; CAD; CHD; CORONARY ARTERY DISEASE; CORONARY SYNDROME; CORONARY syndrome; ChD; Coronary ARtery DIsease; Coronary Artery Disease; Coronary Disease; Coronary Heart Disease; Coronary Heart diseas…
DOID:0060145 6115 ANALGESIA; Analgesia; analgeSia; analgesia
DOID:9352 5848 Diabetes Mellitus Type 2; Diabetes Mellitus Type II; Diabetes Mellitus type 2; Diabetes Mellitus, Type II; Diabetes mellitus Type 2; Diabetes mellitus non-insulin-dependent; Diabetes mellitus type …
DOID:10283 5056 Familial Prostate Cancer; HPC; PRostate Cancer; Prostate CAncer; Prostate Cancer; Prostate cancer; Prostatic cancer; hereditary prostate cancer; prostate Cancer; prostate cancer; prostate-cancer; p…
DOID:8469 4985 FLU; Flu; Influenza; flu; influenza
DOID:225 4962 SYNDROME; Syndrome; syn drome; syndrome
DOID:3908 4959 NSCLC; Non Small Cell Lung Cancer; Non Small Cell Lung Carcinoma; Non Small Cell Lung cancer; Non small cell lung cancer; Non small-cell lung cancer; Non- small cell lung cancer; Non-Small Cell Lun…
DOID:784 4841 CKD; CKF; CRD; CRF; Chronic Kidney Disease; Chronic Kidney disease; Chronic Kidney failure; Chronic Renal Disease; Chronic kidney disease; Chronic kidney failure; Chronic renal disease; chronic Kid…
DOID:5419 4689 SCHIZOPHRENIA; Schizophrenia; schizophrenia
DOID:684 3836 HCC; HEPATOCELLULAR CARCINOMA; Hepatocellular Carcinoma; Hepatocellular carcinoma; Hepatoma; hcc; hepato-cellular carcinoma; hepatocellular Carcinoma; hepatocellular carcinoma; hepatoma
DOID:5844 3664 Heart Attack; Heart attack; MYOCARDIAL INFARCTION; Myocardial Infarct; Myocardial Infarction; Myocardial infarct; Myocardial infarction; heart attack; myo-cardial infarction; myocardiaL infARction;…

7.2 Disease mentions by study.

Sort synonyms terms by frequency.

Disease mentions by study (Random sample of studies)
nct_id doid N_mentions disease_terms
NCT00300742 DOID:1574 2 alcohol abuse
NCT00300742 DOID:0050741 1 alcohol dependence
NCT00300742 DOID:8670 1 eating disorder
NCT00598182 DOID:1094 3 ADHD
NCT01591681 DOID:9993 1 hypoglycemia
NCT01661010 DOID:5223 4 infertility;Infertility
NCT01859234 DOID:0070004 2 myeloma
NCT01859234 DOID:9538 1 Multiple Myeloma
NCT01859234 DOID:2355 1 anaemia
NCT02311036 DOID:6713 3 stroke
NCT02311036 DOID:0060046 1 aphasia
NCT02498067 DOID:526 1 HIV infection
NCT03510793 DOID:7693 1 abdominal aortic aneurysm
NCT03510793 DOID:326 1 Ischemia
NCT03528473 DOID:1612 3 breast cancer
NCT03528473 DOID:870 1 peripheral neuropathy
NCT03780439 DOID:13148 1 urinary tract infection
NCT03780439 DOID:10534 1 gastric cancer
NCT03780439 DOID:8437 1 intestinal obstruction
NCT03780439 DOID:552 1 pneumonia
NCT03780439 DOID:162 1 Cancer

8 Identify and enumerate drug, disease, target combinations associated by study (NCI_ID)

And include references.

8.1 Aggregate disease mentions by study (NCT_ID)

9 Aggregating, scoring and ranking disease, target associations.

Evidence weighted by: